Case Study 3: Flagging Spam Emails¶

Team: Ana Glaser, Jake Harrison, Rob Burigo, and Yvan Sojdehei¶

Business Understanding¶

The company in question has workers recieving vast amounts of emails every day. The objective of this Case Study is to classify whether or not an email is spam or work-related to streamline inboxes to only include the important emails.

Modeling Preparations¶

Methods:¶

The methods our team applied to solve this problem include Naive Bayes and K-Means Clustering. In preparation for the modeling, the team partitioned each email segment from the email file into a structured form. Then various natural language processing cleansing techniques were implemented to conduct the exploratory data analysis. Due to the unique nature of the input data, an email file, this approach seemed appropriate to feature engineer the enriched data available within the file structure.

Evaluation Metrics:¶

The metrics our team utilized for this project include F-1 Score and the Confusion Matrix for the Naive Bayes.

Initially we intended to use Accuracy and Precision, but due to the imbalanced nature of the data we pivoted towards F1 Score.

The main reason our team chose to use these metrics is because they evaluate the model performance better with disproportionate classifications that exist within our dataset. F1 is a calculation of recall and precision which represents a more holisitic view of the success of our Naive Bayes model.

Importing Packages¶

import os
import pandas as pd
import numpy as np
from email.parser import BytesParser, Parser
from email.policy import default
import email
from html.parser import HTMLParser
from bs4 import BeautifulSoup
from collections import Counter
import re
import plotly.express as px
import nltk
import seaborn as sns
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
import statsmodels.api as sm
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
from yellowbrick.model_selection import FeatureImportances

from sklearn.feature_extraction.text import TfidfVectorizer

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jake\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Jake\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Jake\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Jake\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!

Defining Functions to use within the Feature Engineering section.¶

def strip_html(text):
    soup = BeautifulSoup(text,"html.parser")
    return soup.get_text()


def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stop_list:
            new_words.append(word)
    return new_words

def lemmatize_list(words):
    lemmatizer = WordNetLemmatizer()
    new_words = []
    for word in words:
        new_words.append(lemmatizer.lemmatize(word, pos='v'))
    return new_words

def normalize(words):
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_stopwords(words)
    words = lemmatize_list(words)
    return ' '.join(words)

stop_words = stopwords.words('english')

customlist = ['not', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn',
        "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',
        "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn',
        "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

stop_list = list(set(stop_words) - set(customlist))

Data Evaluation / Engineering¶

Importing Data. We consolidated all email file directories into a ham and nonspam folder for ease of processing.¶

spam_list = os.listdir("./spam/")
ham_list = os.listdir("./easy_ham/")

First Iteration is to clean and combine the non spam email file headers with its content.¶

folder=("./easy_ham")
files=os.listdir(folder)
emails=[folder+'/'+file for file in files]

dataStackText = pd.DataFrame()
my_dict_text = {"Words":[], "Type":[]}

words=[]
for email in emails:
    f = open(email, encoding='latin-1')
    blob = f.read()
    my_dict_text["Words"].append(blob)
    my_dict_text['Type'].append(0)
dataStackText = pd.DataFrame(my_dict_text)

dataStack = pd.DataFrame()
my_dict = {"To":[], "From":[], "Subject":[]}

for i in ham_list:
    with open("./easy_ham/"+i, 'rb') as fp:
        try:
            headers = BytesParser(policy=default).parse(fp)
            to_text = '{}'.format(headers['to'])
            from_text = '{}'.format(headers['from'])
            subject = '{}'.format(headers['subject'])    
            my_dict["To"].append(to_text)
            my_dict["From"].append(from_text)
            my_dict["Subject"].append(subject)
        except Exception:
            continue

dataStack = pd.DataFrame(my_dict)

EMAIL_DF = pd.merge(dataStack,dataStackText, how='left',left_index = True, right_index = True)

COMPLETE_EMAIL_DF = EMAIL_DF
COMPLETE_EMAIL_DF.shape

(6940, 5)

Second Iteration is to clean and combine the spam email file headers with its content.¶

folder=("./spam")
files=os.listdir(folder)
emails=[folder+'/'+file for file in files]

dataStackText = pd.DataFrame()
my_dict_text = {"Words":[], "Type":[]}

words=[]
for email in emails:
    f = open(email, encoding='latin-1')
    blob = f.read()
    my_dict_text["Words"].append(blob)
    my_dict_text['Type'].append(1)
dataStackText = pd.DataFrame(my_dict_text)

dataStack = pd.DataFrame()
my_dict = {"To":[], "From":[], "Subject":[]}

for i in spam_list:
    with open("./spam/"+i, 'rb') as fp:
        try:
            headers = BytesParser(policy=default).parse(fp)
            #print(headers)
            to_text = '{}'.format(headers['to'])
            from_text = '{}'.format(headers['from'])
            subject = '{}'.format(headers['subject'])
            my_dict["To"].append(to_text)
            my_dict["From"].append(from_text)
            my_dict["Subject"].append(subject)
        except Exception:
            continue

dataStack = pd.DataFrame(my_dict)

We stacked the nonspam and spam datasets together to create one dataframe to analyze.¶

EMAIL_DF = pd.merge(dataStack,dataStackText, how='left',left_index = True, right_index = True)

COMPLETE_EMAIL_DF = COMPLETE_EMAIL_DF.append(EMAIL_DF)
COMPLETE_EMAIL_DF.shape

(9338, 5)

COMPLETE_EMAIL_DF.head()

Normalized the data for analysis using standard Natural Language Processing data cleansing techniques such as stripping html, removing punctuation, stop words, and made everything lowercase.¶

# Removing any excess html
COMPLETE_EMAIL_DF['Words'] = COMPLETE_EMAIL_DF['Words'].apply(lambda x: strip_html(x))
COMPLETE_EMAIL_DF['Subject'] = COMPLETE_EMAIL_DF['Subject'].apply(lambda x: strip_html(x))

# Tokenizing each column
COMPLETE_EMAIL_DF['Words_norm'] = COMPLETE_EMAIL_DF.apply(lambda row: nltk.word_tokenize(row['Words']), axis=1)
COMPLETE_EMAIL_DF['Subject_norm'] = COMPLETE_EMAIL_DF.apply(lambda row: nltk.word_tokenize(row['Subject']), axis=1)
COMPLETE_EMAIL_DF['To_norm'] = COMPLETE_EMAIL_DF.apply(lambda row: nltk.word_tokenize(row['To']), axis=1)
COMPLETE_EMAIL_DF['From_norm'] = COMPLETE_EMAIL_DF.apply(lambda row: nltk.word_tokenize(row['From']), axis=1)

# normalizing each column

COMPLETE_EMAIL_DF['Subject_norm'] = COMPLETE_EMAIL_DF.apply(lambda row: normalize(row['Subject_norm']), axis=1)
COMPLETE_EMAIL_DF['Words_norm'] = COMPLETE_EMAIL_DF.apply(lambda row: normalize(row['Words_norm']), axis=1)
COMPLETE_EMAIL_DF['To_norm'] = COMPLETE_EMAIL_DF.apply(lambda row: normalize(row['To_norm']), axis=1)
COMPLETE_EMAIL_DF['From_norm'] = COMPLETE_EMAIL_DF.apply(lambda row: normalize(row['From_norm']), axis=1)

COMPLETE_EMAIL_DF.head()

Created a list of most common spam words to leverage in the feature engineering functions.¶

folder=("./spam")
files=os.listdir(folder)
emails=[folder+'/'+file for file in files]

words=[]
for email in emails:
    f = open(email, encoding='latin-1')
    blob = f.read()
    words += blob.split(" ")
words = to_lowercase(words)
words = remove_stopwords(words)

for i in range(len(words)):
    if not words[i].isalpha():
        words[i]=""
word_dict = Counter(words)
del word_dict[""]

word_dict = word_dict.most_common(1000)
word_dict = [k for k,v in word_dict]

Now we have a dataframe that contains a series of the original and normalized email content.¶

COMPLETE_EMAIL_DF.head()

Feature Engineering:¶

As part of the analysis, we appended a series of data characteristics based on spam emails' key word frequency.¶

Passing most common spam words and getting a count for each email's subject and content.¶

spam_list_words = word_dict
COMPLETE_EMAIL_DF['spam_count_content'] = COMPLETE_EMAIL_DF['Words_norm'].apply(lambda x: sum(i in spam_list_words for i in x.split()))
COMPLETE_EMAIL_DF['spam_count_subject'] = COMPLETE_EMAIL_DF['Subject_norm'].apply(lambda x: sum(i in spam_list_words for i in x.split()))

Getting a capital letter count for each email's from, subject, and content.¶

COMPLETE_EMAIL_DF['subject_cl_count'] = COMPLETE_EMAIL_DF['Subject'].apply(lambda x: sum(1 for c in x if c.isupper()))
COMPLETE_EMAIL_DF['from_cl_count'] = COMPLETE_EMAIL_DF['From'].apply(lambda x: sum(1 for c in x if c.isupper()))
COMPLETE_EMAIL_DF['content_cl_count'] = COMPLETE_EMAIL_DF['Words'].apply(lambda x: sum(1 for c in x if c.isupper()))

Getting the character count and digit count of the from address as this could flag possible spam¶

COMPLETE_EMAIL_DF['from_ch_count'] = COMPLETE_EMAIL_DF['From'].apply(lambda x: sum(1 for c in x if c!=''))
COMPLETE_EMAIL_DF['from_int_count'] = COMPLETE_EMAIL_DF['From'].apply(lambda x: sum(1 for c in x if c.isdigit()))

Creating a boolean field for whether the from address ends in com, us, and edu.¶

COMPLETE_EMAIL_DF['from_dotcom'] = COMPLETE_EMAIL_DF['From_norm'].apply(lambda x: x.endswith("com"))
COMPLETE_EMAIL_DF['from_dotedu'] = COMPLETE_EMAIL_DF['From_norm'].apply(lambda x: x.endswith("edu"))
COMPLETE_EMAIL_DF['from_dotus'] = COMPLETE_EMAIL_DF['From_norm'].apply(lambda x: x.endswith("us"))

COMPLETE_EMAIL_DF['is_spam'] = COMPLETE_EMAIL_DF['Type']

Dropping columns to get final dataset to start our EDA¶

data_final = COMPLETE_EMAIL_DF.drop(['To','From','Subject','Type'], axis=1)

data_final

Exploratory Data Analysis¶

After the final dataset is created, we had to ensure there were no missing values.¶

data_final.isnull().sum()

Words                 0
Words_norm            0
Subject_norm          0
To_norm               0
From_norm             0
spam_count_content    0
spam_count_subject    0
subject_cl_count      0
from_cl_count         0
content_cl_count      0
from_ch_count         0
from_int_count        0
from_dotcom           0
from_dotedu           0
from_dotus            0
is_spam               0
dtype: int64

Displaying the summary of our dataset. We have observed some character attributes having max values that are significantly higher than the mean or median resulting in a skewed dataset.¶

data_final.describe()

Histograms of our continuous variables where we can visually observe the skewness within our dataset.¶

columns = list(data_final.select_dtypes('int64'))
data_final[columns].hist(stacked=False, bins=100, figsize=(15,15), layout=(5,3));

data_final.skew()

spam_count_content     9.379257
spam_count_subject     1.747891
subject_cl_count       3.572663
from_cl_count          3.644891
content_cl_count      31.577371
from_ch_count          1.582716
from_int_count         4.158932
from_dotcom           -0.006855
from_dotedu            5.635035
from_dotus            11.680077
is_spam                1.113557
dtype: float64

Looking at the mean for both of our spam count columns. Can see that for both the mean is higher for the spam emails (1) than it is for the ham emails (0)¶

data_final.groupby(['is_spam'])['spam_count_content'].mean()

is_spam
0    111.692507
1    145.607590
Name: spam_count_content, dtype: float64

data_final.groupby(['is_spam'])['spam_count_subject'].mean()

is_spam
0    1.168300
1    2.224771
Name: spam_count_subject, dtype: float64

We identified an outlier within the spam word count of the subject line which needs to be addressed before moving forward. Considering this is only one record, we will remove it instead of replacing it with the top whisker, also known as the 75% percentile.¶

sns.boxplot(x="is_spam", y="spam_count_subject", data=data_final)

<AxesSubplot:xlabel='is_spam', ylabel='spam_count_subject'>

After the outlier was removed, it was easier for us to observe the magnitude of the difference in spam versus non-spam profiles. Spam subject and content word count distribution is higher than non-spam emails.¶

data_final = data_final[data_final['spam_count_subject'] < 20]
sns.boxplot(x="is_spam", y="spam_count_subject", data=data_final)

<AxesSubplot:xlabel='is_spam', ylabel='spam_count_subject'>

data_final = data_final[data_final['spam_count_content'] < 300]
sns.boxplot(x="is_spam", y="spam_count_content", data=data_final)

<AxesSubplot:xlabel='is_spam', ylabel='spam_count_content'>

After eliminating the outlier observations from the dataset, the skewness was reduced significantly for those attributes.¶

data_final.skew()

spam_count_content     1.288255
spam_count_subject     1.104251
subject_cl_count       3.611340
from_cl_count          3.616751
content_cl_count      31.300991
from_ch_count          1.565681
from_int_count         4.187774
from_dotcom            0.015974
from_dotedu            5.773160
from_dotus            12.483059
is_spam                1.145743
dtype: float64

The structure of our model entails aggregating the frequency of spam words and email segment word counts such as words in subject line and words in the email body to determine if an email is spam or not.¶

model_data = data_final.select_dtypes(exclude = 'object')

model_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8891 entries, 2 to 2397
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   spam_count_content  8891 non-null   int64
 1   spam_count_subject  8891 non-null   int64
 2   subject_cl_count    8891 non-null   int64
 3   from_cl_count       8891 non-null   int64
 4   content_cl_count    8891 non-null   int64
 5   from_ch_count       8891 non-null   int64
 6   from_int_count      8891 non-null   int64
 7   from_dotcom         8891 non-null   bool 
 8   from_dotedu         8891 non-null   bool 
 9   from_dotus          8891 non-null   bool 
 10  is_spam             8891 non-null   int64
dtypes: bool(3), int64(8)
memory usage: 971.2 KB

There is distinct separation in the attributes of emails that are spam versus not spam. Our model will leverage these engineered data features to execute the classification task.¶

fig = px.scatter_3d(model_data, x='spam_count_subject', y='from_ch_count', z='from_cl_count', color='is_spam', title="Separation of Spam")
fig.update_layout(width = 550, height = 550,margin=dict(l=0, r=0, b=0, t=0))
fig.show()

Model Building & Evaluation¶

KMeans Clustering:¶

KMeans Clustering utilizes a user-specified amount of clusters, k, and uses the calculated means to bucket the observations into their designated cluster.¶

x = model_data.iloc[:]# 1t for rows and second for columns
x

Utilizing the elbow method, the optimal number of clusters were identified as 2. There does seem to be a second fold in the elbow at 4 clusters. After the evaluation of the clusters, we determined 2 yielded better separation. We will utilize these cluster labels within our Naive Bayes model and assess the model accuracy to determine their effectiveness.¶

sse = {}
# Fit KMeans and calculate SSE for each k
for k in range(1, 10):
    # Initialize KMeans with k clusters
    kmeans = KMeans(n_clusters=k, random_state=1)  
    # Fit KMeans on the normalized dataset
    kmeans.fit(model_data)  
    # Assign sum of squared distances to k element of dictionary
    sse[k] = kmeans.inertia_
# Plotting the elbow plot
plt.figure(figsize=(12,8))
plt.title('The Elbow Method')
plt.xlabel('k'); 
plt.ylabel('Sum of squared errors')
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()

kmeans = KMeans(n_clusters=2, algorithm = 'full')
kmeans.fit(x)

KMeans(algorithm='full', n_clusters=2)

Cluster the existing data¶

identified_clusters = kmeans.fit_predict(x)
identified_clusters

array([0, 0, 0, ..., 0, 0, 0])

Merge the clusters with the main dataset¶

data_with_clusters = model_data.copy()
data_with_clusters['Clusters'] = identified_clusters

data_with_clusters

Naive Bayes:¶

A classification model that uses a probability-based algorithm that classifies data based on the likelihood of the events occuring within the dataset.¶

Split the data into a 75-25% train-test datasets.¶

X = data_with_clusters.drop('is_spam', axis=1)
y = data_with_clusters['is_spam']
X_train, X_test,y_train,y_test=train_test_split(X,y,test_size=0.25)

Based on our EDA, our data displayed behaviors of not having a normal distribution, thus, we chose to implement the Multinomial Naive Bayes model.¶

clf=MultinomialNB()

clf.fit(X_train, y_train)

MultinomialNB()

y_pred=clf.predict(X_test)

The model yielded a 0.17 F1 Score. Since our objective is to minimize F1, we feel confident in our model's performance.¶

print('F1 Score: %.3f' % f1_score(y_test, y_pred))

F1 Score: 0.216

predictions = pd.DataFrame(y_pred, columns = ['Predictions'])

X_test = X_test.reset_index()
results = pd.merge(X_test, predictions, left_index=True, right_index=True)

results.head(25)

The Confusion Matrix displays the distribution of the classifications with the Naive Bayes model. Around 22% of the test data was misclassified as not spam, but was actually spam, whereas 73% was correctly predicted as not spam.¶

cnf_matrix = confusion_matrix(y_test, y_pred)
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Text(0.5, 352.48, 'Predicted label')

Case Conclusions¶

Due to the success of the classification performed by Naive Bayes, we believe this model is appropriate for the dataset. Spam detection will never be 100% accurate, but we hope that the company in question implements this model to detect the spam emails in their employee's emails. Upon looking at the features important to the models, we can use these attributes to train new employees to identify what could potentially be spam or not.¶

The top 4 important features were from_dot_edu, our cluster labels from KMeans, from_dot_us, from_dot_com which the domain from which the email is sent should have an impact on the data.¶

viz = FeatureImportances(clf, relative=False)
viz.fit(X_train, y_train)

FeatureImportances(ax=<AxesSubplot:>, estimator=MultinomialNB(), relative=False)

Further analysis¶

Since the model is not 100% accurate - missing 22% of spam emails in our testing set, we could display the emails that were not classified as spam but had a high probability of actually being spam with a soft warning. If the user designates this as spam or not, as a result of this prompt, we could use this feedback to improve the models.¶

	To	From	Subject	Words
0	Chris Garrigues <cwg-dated-1030314468.7c7c85@D...	Robert Elz <kre@munnari.OZ.AU>	Re: New Sequences Window	Return-Path: <exmh-workers-admin@spamassassin....
1	mkettler@home.com	The Motley Fool <Fool@motleyfool.com>	Personal Finance: Resolutions You Can Keep	Return-Path: Fool@motleyfool.com\nDelivery-Dat...
2	Valdis.Kletnieks@vt.edu	Chris Garrigues <cwg-exmh@DeepEddy.Com>	Re: New Sequences Window	From exmh-workers-admin@redhat.com Wed Aug 21...
3	rod-3ds@arsecandle.org	malcolm-sweeps@mrichi.com	Malcolm in the Middle Sweepstakes Prize Notifi...	Return-Path: <malcolm-sweeps@mrichi.com>\nDeli...
4	Robert Elz <kre@munnari.OZ.AU>	Chris Garrigues <cwg-exmh@DeepEddy.Com>	Re: New Sequences Window	From exmh-workers-admin@redhat.com Wed Aug 21...

	To	From	Subject	Words	Words_norm	Subject_norm	To_norm	From_norm
0	Chris Garrigues <cwg-dated-1030314468.7c7c85@D...	Robert Elz <kre@munnari.OZ.AU>	Re: New Sequences Window	Return-Path: \nDelivered-To: yyyy@localhost.ne...	returnpath deliveredto yyyy localhostnetnotein...	new sequence window	chris garrigues cwgdated10303144687c7c85 deepe...	robert elz kre munnariozau
1	mkettler@home.com	The Motley Fool <Fool@motleyfool.com>	Personal Finance: Resolutions You Can Keep	Return-Path: Fool@motleyfool.com\nDelivery-Dat...	returnpath fool motleyfoolcom deliverydate wed...	personal finance resolutions keep	mkettler homecom	motley fool fool motleyfoolcom
2	Valdis.Kletnieks@vt.edu	Chris Garrigues <cwg-exmh@DeepEddy.Com>	Re: New Sequences Window	From exmh-workers-admin@redhat.com Wed Aug 21...	exmhworkersadmin redhatcom wed aug 21 161835 2...	new sequence window	valdiskletnieks vtedu	chris garrigues cwgexmh deepeddycom
3	rod-3ds@arsecandle.org	malcolm-sweeps@mrichi.com	Malcolm in the Middle Sweepstakes Prize Notifi...	Return-Path: \nDelivered-To: rod@arsecandle.or...	returnpath deliveredto rod arsecandleorg recei...	malcolm middle sweepstakes prize notification	rod3ds arsecandleorg	malcolmsweeps mrichicom
4	Robert Elz <kre@munnari.OZ.AU>	Chris Garrigues <cwg-exmh@DeepEddy.Com>	Re: New Sequences Window	From exmh-workers-admin@redhat.com Wed Aug 21...	exmhworkersadmin redhatcom wed aug 21 161836 2...	new sequence window	robert elz kre munnariozau	chris garrigues cwgexmh deepeddycom

	To	From	Subject	Words	Words_norm	Subject_norm	To_norm	From_norm
0	Chris Garrigues <cwg-dated-1030314468.7c7c85@D...	Robert Elz <kre@munnari.OZ.AU>	Re: New Sequences Window	Return-Path: \nDelivered-To: yyyy@localhost.ne...	returnpath deliveredto yyyy localhostnetnotein...	new sequence window	chris garrigues cwgdated10303144687c7c85 deepe...	robert elz kre munnariozau
1	mkettler@home.com	The Motley Fool <Fool@motleyfool.com>	Personal Finance: Resolutions You Can Keep	Return-Path: Fool@motleyfool.com\nDelivery-Dat...	returnpath fool motleyfoolcom deliverydate wed...	personal finance resolutions keep	mkettler homecom	motley fool fool motleyfoolcom
2	Valdis.Kletnieks@vt.edu	Chris Garrigues <cwg-exmh@DeepEddy.Com>	Re: New Sequences Window	From exmh-workers-admin@redhat.com Wed Aug 21...	exmhworkersadmin redhatcom wed aug 21 161835 2...	new sequence window	valdiskletnieks vtedu	chris garrigues cwgexmh deepeddycom
3	rod-3ds@arsecandle.org	malcolm-sweeps@mrichi.com	Malcolm in the Middle Sweepstakes Prize Notifi...	Return-Path: \nDelivered-To: rod@arsecandle.or...	returnpath deliveredto rod arsecandleorg recei...	malcolm middle sweepstakes prize notification	rod3ds arsecandleorg	malcolmsweeps mrichicom
4	Robert Elz <kre@munnari.OZ.AU>	Chris Garrigues <cwg-exmh@DeepEddy.Com>	Re: New Sequences Window	From exmh-workers-admin@redhat.com Wed Aug 21...	exmhworkersadmin redhatcom wed aug 21 161836 2...	new sequence window	robert elz kre munnariozau	chris garrigues cwgexmh deepeddycom

	spam_count_content	spam_count_subject	subject_cl_count	from_cl_count	content_cl_count	from_ch_count	from_int_count	is_spam
count	9338.000000	9338.000000	9338.000000	9338.000000	9338.000000	9338.000000	9338.000000	9338.000000
mean	120.401906	1.439602	4.982652	2.017027	389.711501	34.314093	0.845256	0.256800
std	165.282288	1.414154	5.453663	2.834296	2548.711998	12.149288	2.491055	0.436892
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	63.000000	0.000000	2.000000	0.000000	148.000000	27.000000	0.000000	0.000000
50%	87.000000	1.000000	4.000000	2.000000	208.000000	33.000000	0.000000	0.000000
75%	123.000000	2.000000	6.000000	2.000000	295.000000	40.000000	0.000000	1.000000
max	3748.000000	28.000000	81.000000	32.000000	116506.000000	144.000000	45.000000	1.000000

	spam_count_content	spam_count_subject	subject_cl_count	from_cl_count	content_cl_count	from_ch_count	from_int_count	from_dotcom	from_dotedu	from_dotus	is_spam
2	117	1	4	5	409	39	0	True	False	False	0
4	134	1	4	5	423	39	0	True	False	False	0
5	35	1	1	4	51	33	0	True	False	False	0
8	92	1	4	2	355	23	0	False	True	False	0
10	70	0	4	0	410	35	4	False	False	False	0
...	...	...	...	...	...	...	...	...	...	...	...
2391	190	3	4	1	340	29	9	True	False	False	1
2392	190	3	2	1	339	25	5	True	False	False	1
2394	97	2	1	2	137	23	0	False	False	False	1
2395	107	3	4	0	134	31	2	True	False	False	1
2397	0	0	1	1	0	4	0	False	False	False	1